choose “R-4.2.3-arm64.pkg” for Apple silicon Macs (M1 or higher)
otherwise choose “R-4.2.3.pkg”.
also install the Command Line Tools by typing xcode-select --install in your terminal if you haven’t done so. It helps compile some R packages that rely on other languages such as C++ or Fortran.
Windows:
click “install R for the first time”
also click “RTools” and install it. Rtools can help compile some R packages that rely on other languages such as C++ or Fortran.
Two components:
R console
run command on it to generate corresponding output.
analogy: musical instrument 🎻
R script:
record the command in plain text; easier for other people to circulate and reproduce your results.
analogy: written sheet music with notations 🎼
What is RStudio?
An integrated development environment (IDE) for R. It includes:
R console
syntax-highlighting editor
tools for plotting, debugging, and workspace management
TL;DR - RStudio provides various tools that makes R programming easier.
Change Appearance and Pane Layout: Tools > Global Options... > Appearance/Pane Layout
Don’t save workspace to .Rdata on exit: Tools > Global Options... > General > “Save workspace to .RData on exit” > “Never”
Don’t restore .Rdata into workspace: Tools > Global Options... > General > uncheck “Restore .RData into workspace at startup”
Reloading a saved workspace may be convenient to you; but it makes your code less reproducible on other people’s machine.
Install Packages
What is package?
a collection of R functions, compiled codes and datasets for reuse.
Important packages for this course:
tidyverse: a bundle of packages for data wrangling.
rmarkdown: write documents that embeds R code as well as its output.
qsslearnr: interactive tutorials for Quantitative Social Science
Two actions
Installing package: download the package to your computer.
Loading package: tell R to use the package.
you only need to install once; but you need to load every time.
Install packages from different sources
install.packages("tidyverse") downloads from CRAN by default.
remotes::install_github("rstudio/learnr") downloads from GitHub.
CRAN maintains packages with strict quality requirements by R core teams; GitHub maintains packages by individual developers or small teams that may not go through the same level of testing and quality control as CRAN packages.
## Interactive Tutorials for Quantitative Social Science## Written by Matthew Blackwell ## See here: https://github.com/mattblackwell/qsslearnr# 1. Install `remotes` package: install.packages("remotes")## 2. Install the following packages by running:remotes::install_github("kosukeimai/qss-package", build_vignettes =TRUE)remotes::install_github("rstudio/learnr")remotes::install_github("rstudio-education/gradethis")remotes::install_github("mattblackwell/qsslearnr")## 3. See all available tutorials for QSSlearnr::run_tutorial(package ="qsslearnr")## 4. Run a particular tutoriallearnr::run_tutorial("00-intro", package ="qsslearnr")## 5. If you have problems generating PDF from Rmarkdown## install tinytex by running (takes some time!): # install.packages("tinytex")# tinytex::install_tinytex()
For Mac users, sometimes the installation of qss package may fail because pandoc or curl is not installed or upgraded in your Mac (if you don’t encounter these problems, no need to look at this!).
pandoc is used to convert documents to other types, e.g. convert .html to .pdf or .docx.
To install or upgrade pandoc or curl, we can first install the package manager Homebrew, and then install them by using brew install pandoc or brew install curl in Mac Terminal.
QSS Exercise
Agenda
Introduction
R and RStudio
QSS Exercise
QSS Tutorial 0
Any questions?
Bias in Self-Reported Turnout
Use read.csv() to load the voter turnout data
If your datasets are stored in other formats, such as .xlsx, .sav or .dta, you need external packages such as readxl, foreign or haven to help you load your datasets.
File management practices in R
Option 1: Use setwd() to open target folder as current working directory in R.
Option 2: Open your target folder as an R project.
In this exercise, we will also explore how to visualize data in R in two different ways:
R base graphics: require more code to generate plots, but is more flexible.
ggplot2: requires less code to generate plots, but is more restrictive, e.g. we need to transform the data before plotting.
Variable
Description
year
election year
ANES
ANES estimated turnout rate
VEP
voting eligible population (in thousands)
VAP
voting age population (in thousands)
total
total ballots cast for highest office (in thousands)
felons
total ineligible felons (in thousands)
noncitizens
total noncitizens (in thousands)
overseas
total eligible overseas voters (in thousands)
osvoters
total ballots counted by overseas voters (in thousands)
Load the data into R and check the dimensions of the data. Also, obtain a summary of the data. How many observations are there? What is the range of years covered in this data set?
turnout <-read.csv("./data/turnout.csv") # load the dataset as a data.frame in Rdim(turnout) # the dimensions of the dataset: 14 rows (observations) x 9 columns (variables)
[1] 14 9
# we can also use `nrow()` to solely fetch the number of rows# and `ncol()` to solely fetch the number of columnshead(turnout, n =5) # the first 5 rows of the dataset
year VEP VAP total ANES felons noncit overseas osvoters
1 1980 159635 164445 86515 71 802 5756 1803 NA
2 1982 160467 166028 67616 60 960 6641 1982 NA
3 1984 167702 173995 92653 74 1165 7482 2361 NA
4 1986 170396 177922 64991 53 1367 8362 2216 NA
5 1988 173579 181955 91595 70 1594 9280 2257 NA
summary(turnout) # get the range and quartiles of each variable
year VEP VAP total
Min. :1980 Min. :159635 Min. :164445 Min. : 64991
1st Qu.:1986 1st Qu.:171192 1st Qu.:178930 1st Qu.: 73179
Median :1993 Median :181140 Median :193018 Median : 89055
Mean :1993 Mean :182640 Mean :194226 Mean : 89778
3rd Qu.:2000 3rd Qu.:193353 3rd Qu.:209296 3rd Qu.:102370
Max. :2008 Max. :213314 Max. :230872 Max. :131304
ANES felons noncit overseas osvoters
Min. :47.00 Min. : 802 Min. : 5756 Min. :1803 Min. :263
1st Qu.:57.00 1st Qu.:1424 1st Qu.: 8592 1st Qu.:2236 1st Qu.:263
Median :70.50 Median :2312 Median :11972 Median :2458 Median :263
Mean :65.79 Mean :2177 Mean :12229 Mean :2746 Mean :263
3rd Qu.:73.75 3rd Qu.:3042 3rd Qu.:15910 3rd Qu.:2937 3rd Qu.:263
Max. :78.00 Max. :3168 Max. :19392 Max. :4972 Max. :263
NA's :13
turnout$year # we use `$` to get specific variables from the data.frame; this is a vector of year
# alternatively, we can use `turnout[, "year"]` to do the same thing.length(turnout$year) # the length of `year` vector is 14.
[1] 14
Calculate the turnout rate based on the voting age population or VAP. Note that for this data set, we must add the total number of eligible overseas voters since the VAP variable does not include these individuals in the count. Next, calculate the turnout rate using the voting eligible population or VEP. What difference do you observe? (Additionally, how can we visualize the temporal change in this difference?)
plot( tr_vep - tr_vap ~ year, # variable on y-axis ~ variable on x-axisdata = turnout, # the dataset from which we extract variables for plottingtype ="l", # the type of plot; we use lines to visualize the time series xlab ="Year", ylab ="VEP-based TR - VAP-based TR (%)", # labels of x- and y-axisxaxt ="n"# override previous x-axis breaks)axis(side =1, # redraw the breaks of x-axis to reflect four-year election cycleat =seq(1980, 2008, 4) # highlight presidential election year)
Compute the differences between the VAP and ANES estimates of turnout rate. How big is the difference on average? What is the range of the differences? Conduct the same comparison for the VEP and ANES estimates of voter turnout. Briefly comment on the results.
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.581 15.267 16.893 16.836 18.529 22.489
Compare the VEP turnout rate with the ANES turnout rate separately for presidential elections and midterm elections. Note that the data set excludes the year 2006. Does the bias of the ANES estimates vary across election types?
turnout$midterm <-ifelse(turnout$year %%4!=0, 1, 0) # presidential elections take place in leap year (can be divided by 4); thus, we can use this fact to recognize the midterm election yearturnout$tr_vep[turnout$midterm ==0]
Divide the data into half by election years such that you subset the data into two periods. Calculate the difference between the VEP turnout rate and the ANES turnout rate separately for each year within each period. Has the bias of ANES increased over time?
ANES does not interview prisoners and overseas voters. Calculate an adjustment to the 2008 VAP turnout rate. Begin by subtracting the total number of ineligible felons and noncitizens from the VAP to calculate an adjusted VAP. Next, calculate an adjusted VAP turnout rate, taking care to subtract the number of overseas ballots counted from the total ballots in 2008. Compare the adjusted VAP turnout with the unadjusted VAP, VEP, and the ANES turnout rate. Briefly discuss the results. (Additionally, how can we visualize the comparison among the 4 types of turnout rate?)
turnout$adj_tr_vap <- (turnout$total - turnout$overseas) / (turnout$VAP - turnout$felons - turnout$noncit) *100# install.packages("tidyverse")library(tidyverse)turnout %>%# we use the pipe operator `%>%` to avoid repeatedly referring to the dataset in subsequent functionspivot_longer( # we use `pivot_longer()` to reshape data from wide form to long form, which is more efficient for visualization; see `vignette("pivot")`.c("tr_vap", "tr_vep", "ANES", "adj_tr_vap"), # we plan to convert these columns into one variablenames_to ="type", values_to ="turnout_rate" ) %>%ggplot(aes(x = year, y = turnout_rate, group = type), data = .) +geom_line(aes(color = type)) +scale_x_continuous(breaks =seq(1980, 2008, 4)) +# redraw x-axis breaks to match presidential election cyclescale_color_discrete(name ="Type", labels =c("Adj. VAP", "ANES", "VAP", "VEP")) +labs(x ="Year", y ="Turnout Rate (%)") +theme_bw()